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Abstract 

Most of computer science focuses on automatically solving given computational problems. I focus on 
automatically inventing or discovering problems in a way inspired by the playful behavior of animals and 
humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Con- 
sider the infinite set of all computable descriptions of tasks with possibly computable solutions. Given a 
general problem solving architecture, at any given time, the novel algorithmic framework PowerPlay 
1461 searches the space of possible pairs of new tasks and modifications of the current problem solver, 
until it finds a more powerful problem solver that provably solves all previously learned tasks plus the 
new one, while the unmodified predecessor does not. Newly invented tasks may require to achieve a 
wow-effect by making previously learned skills more efficient such that they require less time and space. 
New skills may (partially) re-use previously learned skills. The greedy search of typical PowerPlay 
variants uses time-optimal program search to order candidate pairs of tasks and solver modifications by 
their conditional computational (time & space) complexity, given the stored experience so far. The new 
task and its corresponding task-solving skill are those first found and validated. This biases the search 
towards pairs that can be described compactly and validated quickly. The computational costs of validat- 
ing new tasks need not grow with task repertoire size. Standard problem solver architectures of personal 
computers or neural networks tend to generalize by solving numerous tasks outside the self-invented 
training set; PowerPlay's ongoing search for novelty keeps breaking the generalization abilities of 
its present solver. This is related to Godel's sequence of increasingly powerful formal theories based 
on adding formerly unprovable statements to the axioms without affecting previously provable theorems. 
The continually increasing repertoire of problem solving procedures can be exploited by a parallel search 
for solutions to additional externally posed tasks. PowerPlay may be viewed as a greedy but practical 
implementation of basic principles of creativity 1421 145 1 . A first experimental analysis can be found in 
separate papers 1531 1521 . 
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1 Introduction 



Given a realistic piece of computational hardware with specific resource limitations, how can one devise 
software for it that will solve all, or at least many, of the a priori unknown tasks that are in principle easily 
solvable on this architecture? In other words, how to build a practical general problem solver, given the 
computational restrictions? It does not need to be universal and asymptotically optimal ifTTl [131 14111441 like 
the recent (not necessarily practically feasible) general problem solvers discussed in Section l9n instead it 
should take into account all constant architecture-specific slowdowns ignored in the asymptotic optimality 
notation of theoretical computer science, and be generally useful for real-world applications. 

Let us draw inspiration from biology. How do initially helpless human babies become rather general 
problem solvers over time? Apparently by playing. For example, even in the absence of external reward or 
hunger they are curious about what happens if they move their eyes or fingers in particular ways, creating 
little experiments which lead to initially novel and surprising but eventually predictable sensory inputs, 
while also learning motor skills to reproduce these outcomes. (See ||3T1 [30l |37l l42l |45] |62| and Section 
l9.3| for previous artificial systems of this type.) Infants continually seem to invent new tasks that become 
boring as soon as their solutions become known. Easy-to-learn new tasks are preferred over unsolvable or 
hard-to-learn tasks. Eventually the numerous skills acquired in this creative, self-supervised way may get 
re-used to facilitate the search for solutions to external problems, such as finding food when hungry. While 
kids keep inventing new problems for themselves, they move through remarkable developmental stages 

nmmiii]. 

Here I introduce a novel unsupervised algorithmic framework for training a computational problem 
solver from scratch, continually searching for the simplest (fastest to find) combination of task and corre- 
sponding task-solving skill to add to its growing repertoire, without forgetting any previous skills (Section 
lU, or at least without decreasing average performance on previously solved tasks (Section lTTI l. New skills 
may (partially) re-use previously learned skills. Every new task added to the repertoire is essentially de- 
fined by the time required to invent it, to solve it, and to demonstrate that no previously learned skills got 
lost. The search takes into account that typical problem solvers may learn to solve tasks outside the growing 
self-made training set due to generalization properties of their architectures. The framework is called Pow- 
ERPlay because it continually |25 1 aims at boosting computational prowess and problem solving capacity, 
reminiscent of humans or human societies trying to boost their general power/capabilities/knowledge/skills 
in playful ways, even in the absence of externally defined goals, although the skills learned by this type of 
pure curiosity may later help to solve externally posed tasks. 

Unlike our first implementations of curious/creative/playful agents from the 1990s Il30ll54ll37l (Section 
19.31 compare HI EJUS 120 1), PowerPlay provably (by design) does not have any problems with online 
learning — it cannot forget previously learned skills, automatically segmenting its life into a sequence of 
clearly identified tasks with explicitly recorded solutions. Unlike the task search of theoretically optimal 
creative agents |42 , 45 1 (Section |93] l, PowerPlay's task search is greedy, but at least practically feasible. 

Some claim that scientists often invent appropriate problems for thek methods, rather than inventing 
methods to solve given problems. The present paper formalizes this in a way that may be more convenient 
to implement than those of previous work |30J|37j|42J(45 |, and describes a simple practical framework for 
building creative artificial scientists or explorers that by design continually come up with the fastest to find, 
initially novel, but eventually solvable problems. 

1.1 Basic Ideas 

In traditional computer science, given some formally defined task, a search algorithm is used to search a 
space of solution candidates until a solution to the task is found and verified. If the task is hard the search 
may take long. 

To automatically construct an increasingly general problem solver, let us expand the traditional search 
space in an unusual way, such that it includes all possible pairs of computable tasks with possibly com- 
putable solutions, and problem solvers. Given an old problem solver that can already solve a finite known 
set of previously learned tasks, a search algorithm is used to find a new pair that provably has the following 
properties: (1) The new task cannot be solved by the old problem solver. (2) The new task can be solved by 
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the new problem solver (some modification of the old one). (3) The new solver can still solve the known 
set of previously learned tasks. 

Once such a pair is found, the cycle repeats itself. This will result in a continually growing set of known 
tasks solvable by an increasingly more powerful problem solver Solutions to new tasks may (partially) re- 
use solutions to previously learned tasks. 

Smart search (e.g., Section 1431 Algorithm 14.11 ) orders candidate pairs of the type (task, solver) by 
computational complexity, using concepts of optimal universal search ifTTll?!! . with a bias towards pairs 
that can be described by few additional bits of information (given the experience so far) and that can be 
validated quickly. 

At first glance it might seem harder to search for pairs of tasks and solvers instead of solvers only, due 
to the apparently larger search space. However, the additional freedom of inventing the tasks to be solved 
may actually greatly reduce the time intervals between problem solver advances, because the system may 
often have the option of inventing a rather simple task with an easy-to-find solution. 

A new task may be about simplifying the old solver such that it can still solve all tasks learned so far, 
but with less computational resources such as time and storage space (e.g., Section lTTI and Algorithm l7.1b . 

Since the new pair ( task, solver) is the first one found and validated, the search automatically trades 
off the time-varying efforts required to either invent completely new, previously unsolvable problems, or 
compressing/speeding up previous solutions. Sometimes it is easier to refine or simplify known skills, 
sometimes to invent new skills. 

On typical problem solver architectures of personal computers (PCs) or neural networks (NNs), while 
a limited known number of previously learned tasks has become solvable, so too has a large number of 
unknown, never-tested tasks (in the field of Machine Learning, this is known as generalization). POWER- 
Play's ongoing search is continually testing (and always trying to go beyond) the generalization abilities of 
the most recent solver instance; some of its search time has to be spent on demonstrating that self-invented 
new tasks are not already solvable. 

Often, however, much more time will have to be spent on making sure that a newly modified solver did 
not forget any of the possibly many previously learned skills. Problem solver modularization (Section [331 
especially [3321) may greatly reduce this time though, making PowerPlay prefer pairs whose validation 
does not require the re-testing of too many previously learned skills, thus decomposing at least part of the 
search space into somewhat independent regions, realizing divide and conquer strategies as by-products of 
its built-in drive to invent and validate novel tasks/skills as quickly as possible. 

A biologically inspired hope is that as the problem solver is becoming more and more general, it will 
find it easier and easier to solve externally posed tasks (Section|6]l, just like growing infants often seem to 
re-use their playfully acquired skills to solve teacher-given problems. 

1.2 Outline of Remainder 

Section|2]will introduce basic notation and Variant 1 of the algorithmic framework PowerPlay, which in- 
vokes the essential procedures Task INVENTION, Solver Modification, and Correctness Demon- 
stration. Section |3] will discuss details of these procedures. 

More detailed instantiations of PowerPlay will be described in Section l43] (an evolutionary method, 
Alg. 14.3b and Section |4T| (an asymptotically optimal program search method, Alg. 14.1b . 

As mentioned above, the skills acquired to solve self-generated tasks may later greatly facilitate so- 
lutions to externally posed tasks, just like the numerous motor skills learned by babies during curious 
exploration of its world often can be re-used later to maximize external reward. Sections l6l and ItTTI will 
discuss variants of the framework (e.g., Algorithm l7.ll ) in which some of the tasks can be defined externally. 

Section lTTTI will also describe a natural variant of the framework that explicitly penalizes solution costs 
(including time and space complexity), and allows for forgetting aspects of previous solutions, provided 
the average performance on previously solved tasks does not decrease. 

Section |8] will point to illustrative experiments in separate papers 1531 l52l . Section |9l will discuss 
relations to previous work. 
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2 Notation & Algorithmic Framework PowerPlay (Variant I) 



B* denotes the set of finite sequences or bitstrings over the binary alphabet B = {0, 1}, A the empty 
string, X, y, z,p, q, r, u strings in B* , N the natural numbers, K the real numbers, e e M a positive constant, 
TO, 71, no, fc, j, j, fc, I non-negative integers, L{x) the number of bits in x (where L{X) — 0), /, g functions 
mapping integers to integers. We write /(n) = 0{g{n)) if there exist positive c, rtg such that f{n) < cg{n) 
for all n > no. 

The computational architecture of the problem solver may be a deterministic universal computer, or a 
more limited device such as a finite state automaton or a feedforward neural network (NN) |3|. All such 
problem solvers can be uniquely encoded fQ] or implemented on universal computers such as universal 
Turing Machines (TM) |56|. Therefore, without loss of generality, the remainder of this paper assumes a 
fixed universal reference computer whose input programs and outputs are elements of B* . A user-defined 
subset S C B* defines the set of possible problem solvers. For example, if the problem solver's architecture 
is itself a binary universal TM or a standard computer, then S represents its set of possible programs, or 
a limited subset thereof — compare Sections 13.21 and 14.11 If it is a feedforward NN, then S could be a 
highly restricted subset of programs encoding the NN's possible topologies and weights (floating point 
numbers) — compare Section[8]and the original SLIM NN paper f47l. 

In what follows, for convenience I will often identify bitstrings in B* with things they encode, such 
as integers, real-valued vectors, weight matrices, or programs — the context will always make clear what is 
meant. 

The problem solver's initial program is called sq. There is a set of possible task descriptions T C B* . T 
may be the infinite set of all possible computable descriptions of tasks with possibly computable solutions, 
or just a small subset thereof. For example, a simple task may require the solver to answer a particular input 
pattern with a particular output pattern (more formal details on pattern recognition tasks are given in Section 
13.1.11 ). Or it may require the solver to steer a robot towards a goal through a sequence of actions (more 
formal details on sequential decision making tasks in unknown environments are given in Section [3.1.2b . 
There is a particular sequence of task descriptions Ti, T2, . . ., where each unique G T (« = 1, 2, . . .) 
is chosen or "invented" by a search method described below such that the solutions of Ti,T2, ■■■ ,Ti can 
be computed by s;, the i-th instance of the program, but not by Sj_i (i = 1,2,.. .). Each Ti consists 
of a unique problem identifier that can be read by Si through some built-in input processing mechanism 
(e.g., input neurons of an NN |47|), and a unique description of a deterministic procedure for determining 
whether the problem has been solved. Denote T<, = {Ti, . . . , TJ; T^, = {Ti, . . . , Ti^i}. 

A valid task Ti{i > 1) may require solving at least one previously solved task Tk(k < i) more ef- 
ficiently, by using less resources such as storage space, computation time, energy, etc., thus achieving a 
wow-effect. See Section lTT] 

Tasks and problem solver modifications are computed and validated by elements of another appropriate 
set of programs V C B*. Programs p ^ V may contain instructions for reading and executing (parts of) 
the code of the present problem solver and reading (parts of) a recorded history Trace e B* of previous 
events that led to the present solver The algorithmic framework (Alg. |2]i incrementally trains the problem 
solver by finding p ^ P that increase the set of solvable tasks. 

3 Task Invention, Solver Modification, Correctness Demo 

A program tested by Alg. |2]has to allocate its runtime to solve three main jobs, namely. Task Invention, 
Solver Modification, Correctness Demonstration. Now examples of each wifl be Hsted. 

3.1 Implementing Task Invention 

Part of the job of pi G T' is to compute Ti G T. This will consume some of the total computation 
time allocated to pi. Two examples will be given: pattern recognition tasks are treated in Section P.l.U 
sequential decision making tasks in Section [3.1.2| 
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Alg. 12 Algorithmic Framework PowerPlay (Variant I) 
Initialize sq in some way. 
for i := 1, 2, . . . do 
repeat 

Let a search algorithm (examples in Section ^ create a new candidate program p € V. Give p 
limited time to do (not necessarily in this order): 

* Task Invention: Let p compute a task T eT. See SectionITT] 

* Solver Modification: Let p compute a value of the variable q e 5 c -B* (a candidate for Si) 
by computing a modification of Si-i. See Section [J!2] 

* Correctness Demonstration: Let p try to show that T cannot be solved by Si_i, but that 
T and all Tk{k < i) can be solved by q. See Section [33] 

until Correctness Demonstration was successful 
Set Pi :— p; Ti := T; Si := q; update Trace. 
end for 



3.1.1 Example: Pattern Recognition Tasks 

In the context of learning to recognize or analyze patterns, T, could be a 4-tuple (7^, Oi, ti, rii) G I x 
O X N X N, where I,0 C B*, and Ti is solved if Si satisfies L{si) < rii and needs at most ti discrete 
time steps to read li and compute Oi and halt. Here li itself may be a pair {1} ,lf) £ B* x B*, where 
ll is constrained to be the address of an image in a given database of patterns, and if is a -generated 
"query" that uniquely specifies how the image should be classified through target pattern Oi, such that 
the same image can be analyzed in different ways during different tasks. For example, depending on the 
nature of the invented task sequence, the problem solver could eventually learn that O — 1 if — 1001 
(suppressing task indices) and the image addressed by contains at least one black pixel, or if — 0111 
and the image shows a cow. 

Since the definition of task Ti includes bounds rii , ti on computational resources, Ti may be about 
solving at least one Tk{k < i) more efficiently, corresponding to a wow-effect. This in turn may also yield 
more efficient solutions to other tasks Ti{l < i,l ^ k). In practical applications one may insist that such 
efficiency gains must exceed a certain threshold e > 0, to avoid task series causing sequences of very minor 
improvements. 

Note that rii and ti may be unnecessary in special cases such as the problem solver being a fixed 
topology feedforward NN whose input and target patterns have constant size and whose computational 
efforts per pattern need constant time and space resources. 

Assuming sufficiently powerful S, V, in the beginning, trivial tasks such as simply copying tf onto Oi 
may be interesting in the sense that PowerPlay can still validate and accept them, but they will become 
boring (inadmissible) as soon as they are solvable by solutions to previous tasks that generalize to new 
tasks. 

3.1.2 Example: General Decision Making Tasks in Dynamic Environments 

In the more general context of general problem solving/sequential decision making/reinforcement learn- 
ing/reward optimization 11211 [1511551 in unknown environments, there may be a set I C B* of possible task 
identification patterns and a set C B* of programs that test properties of bitstrings. Ti could then encode 
a 4-tuple {li, Ji,ti,ni) G I x x N x N of finite bitstrings with the following interpretation: Si must 
satisfy L(si) < rii and may spend at most ti discrete time steps on first reading li and then interacting with 
an environment through a sequence of perceptions and actions, to achieve some computable goal defined 
by J.. 

More precisely, while Ti is being solved within ti time steps, at any given time 1 < t < ti, the internal 
state of the problem solver at time t is denoted Ui{t) G B*; its initial default value is Ui{Q). For example, 
Ui (t) may encode the current contents of the internal tape of a TM, or of certain addresses in the dynamic 
storage area of a PC, or the present activations of an LSTM recurrent NN lfT2l . At time t, Si can spend 
a constant number of elementary computational instructions to copy the task dscription Ti or the present 
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environmental input Xi (t) £ B* and a reward signal {t) £ B* (interpreted as a real number) into parts of 
Ui{t), then update other parts of Ui{t) (a function of Ui{t — 1)) and compute action yi{t) G B* encoded as 
a part of Ui{t). yi{t) may affect the environment, and thus future inputs. 

If V allows for programs that can dynamically acquire additional physical computational resources 
such as additional CPUs and storage, then the above constant number of elementary computational in- 
structions should be replaced by a constant amount of real time, to be measured by a reliable physical 
clock. 

The sequence of 4-tuples {xi{t) , ri{t) , Ui{t) , yi{t)) {t = 1, . . . ,ti) gets recorded by the so-called trace 
Tracei G B*. If at the end of the interaction a desirable computable property Ji{Tracei) (computed by 
applying program Ji to Tracei) is satisfied, then by definition the task is solved. The set J of possible 
Ji may represent an infinite set of all computable tasks with solutions computable by the given hardware. 
For practical reasons, however, the set J of possible Ji may also be restricted to bit sequences encoding 
just a few possible goals. For example, Ji may only encode goals of the form: a robot arm steered by 
program or "policy" Si has reached a certain target (a desired final observation Xi{ti) recorded in Tracei) 
without measurably bumping into an obstacle along the way, that is, there were no negative rewards, that 
is, ri{T) > for t = 1 . . . t.;. 

If the environment is deterministic, e.g., a digital physics simulation of a robot, then its current state 
can be encoded as part of u{t), and it is straight-forward for CORRECTNESS DEMONSTRATION to test 
whether some Si still can solve a previously solved task Tj{j < i). However, what if the environment is 
only partially observable, like the real world, and non-stationary, changing in unknown ways? Then COR- 
RECTNESS Demonstration must check whether s,; still produces the same action sequence in response 
to the input sequence recorded in TracCj (often this replay-based test will actually be computationally 
cheaper than a test involving the environment). Achieving the same goal in a changed environment must 
be considered a different task, even if the changes are just due to noise on the environmental inputs. (Sure, 
in the real world Sj{j > i) might actually achieve Ji faster than Si, given the description of T,, but COR- 
RECTNESS Demonstration in general cannot know whether this acceleration was due to plain luck — it 
must stick to reproducing Tracej to make sure it did not forget anything.) 

See Section|72] however, for a less strict PowerPlay variant whose CORRECTNESS DEMONSTRA- 
TION directly interacts with the real world to collect sufficient problem-solving statistics through repeated 
trials, making certain assumptions about the probabilistic nature of the environment, and the repeatability 
of experiments. 

3.2 Implementing Solver Modification 

Part of the job of pi G 'P is also to compute Si, possibly profiting from having access to Si_i, because only 
few changes of Si_i may be necessary to come up with an Si that goes beyond Si_i. For example, if the 
problem solver is a standard PC, then just a few bits of the previous software Si_i may need to be changed. 

For practical reasons, the set S of possible s,; may be greatly restricted to bit sequences encoding 
programs that obey the syntax of a standard programming language such as LISP or Java. In turn, the 
programming language describing V should be greatly restricted such that any pi E V can only produce 
syntactically correct Si. 

If the problem solver is a feedforward NN with pre-wired, unmodifiable topology, then S will be re- 
stricted to those bit sequences encoding valid weight matrices, Si will encode its i-th weight matrix, and V 
will be restricted to those p £ V that can produce some Si e S. Depending on the user-defined program- 
ming language, pi may invoke complex pre-wired subprograms (e.g., well-known learning algorithms) as 
primitive instructions — compare separate experimental analysis 1531 [521 . 

In general, p itself determines how much time to spend on SOLVER MODIFICATION — enough time 
must be left for Task Invention and Correctness Demonstration. 

3.3 Implementing Correctness Demonstration 

Correctness demonstration may be the most time-consuming obligation of pi. At first glance it may seem 
that as the sequence Ti, . . . is growing, more and more time will be needed to show that Si but not Si_i 
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can solve Ti, T2, . . . , T;, because one naive way of ensuring correctness is to re-test Si on all previously 
solved tasks. Theoretically more efficient ways are considered next. 

3.3.1 Most General: Proof Search 

The most general way of demonstrating correctness is to encode (in read-only storage) an axiomatic system 
A that formally describes computational properties of the problem solver and possible Si, and to allow pi to 
search the space of possible proofs derivable from A, using a proof searcher subroutine that systematically 
generates proofs until it finds a theorem stating that but not Si-i solves Ti, T2, . . . , (proof search 
may achieve this efficiently without explicitly re-testing Si on Ti, T2, . . . , Ti). This could be done like 
in the Godel Machine [ 44] (Section |9.1| ). which uses an online extension of Universal Search |17] to 
systematically test proof techniques: proof-generating programs that may invoke special instructions for 
generating axioms and applying inference rules to prolong an initially empty proof G B* by theorems, 
which are either axioms or inferred from previous theorems through rules such as modus ponens combined 
with unification, e.g., Q. V can be easily limited to programs generating only syntactically correct proofs 
|44|. A has to subsume axioms describing how any instruction invoked by some s G 5 will change the 
state u of the problem solver from one step to the next (such that proof techniques can reason about the 
effects of any s^). Other axioms encode knowledge about arithmetics etc (such that proof techniques can 
reason about spatial and temporal resources consumed by Sj). 

In what follows, CORRECTNESS Demonstrations will be discussed that are less general but some- 
times more convenient to implement. 

3.3.2 Keeping Track Which Components of the Solver Affect Which Tasks 

Often it is possible to partition s G 5 into components, such as individual bits of the software of a PC, 
or weights of a NN. Here the fc-th component of s is denoted . For each fc (fc = 1, 2, . . .) a variable 
list = {T'^,T2^ . . .) is introduced. Its initial value before the start of PowerPlay is Lq, an empty 
hst. Whenever found and Tj at the end of CORRECTNESS DEMONSTRATION, each L'^ is updated as 
follows: Its new value L^' is obtained by appending to those Tj ^ L^_^{j = 1, . . . ,i) whose current 
(possibly revised) solutions now need at least once during the solution-computing process, and deleting 
those Tj whose current solutions do not use s'^ any more. 

PowerPlay's Correctness Demonstration thus has to test only tasks in the union of all L^. 
That is, if the most recent task does not require changes of many components of s, and if the changed bits 
do not affect many previous tasks, then CORRECTNESS DEMONSTRATION may be very efficient. 

Since every new task added to the repertoire is essentially defined by the time required to invent it, to 
solve it, and to show that no previous tasks became unsolvable in the process, PowerPlay is generally 
"motivated" to invent tasks whose validity check does not require too much computational effort. That is, 
PowerPlay will often find pi that generate Si_i-modifications that don't affect too many previous tasks, 
thus decomposing at least part of the spaces of tasks and their solutions into more or less independent 
regions, realizing divide and conquer strategies as by-products. Compare a recent experimental analysis of 
this effect I.53ii52il . 

3.3.3 Advantages of Prefix Code-Based Problem Solvers 

Let us restrict V such that tested p E V cannot change any components of Si_i during SOLVER MOD- 
IFICATION, but can create a new si only by adding new components to Si_i. (This means freezing all 
used components of any Sk once Tk is found.) By restricting S to self-delimiting prefix codes like those 
generated by the Optimal Ordered Problem Solver (OOPS) BTI . one can now profit from a sometimes 
particularly efficient type of CORRECTNESS DEMONSTRATION, ensuring that differences between Si and 
Si-i cannot affect solutions to T^i under certain conditions. More precisely, to obtain Si, half the search 
time is spent on trying to process Ti first by extending or prolonging Si_i only when the ongo- 

ing computation requests to add new components through special instructions yjj — then CORRECTNESS 
Demonstration has less to do as the set T<i is guaranteed to remain solvable, by induction. The other 
half of the time is spent on processing Ti by a new sub-program with new components s' , a part of Si but 
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not of Si_i, where s' may read Si_i or invoke parts of Si_i as sub-programs to solve T<i — only then 
Correctness Demonstration has to test Sj not only on Ti but also on T<i (see [41 1 for details). 

A simple but not very general way of doing something similar is to interleave Task Invention, 
Solver Modification, Correctness Demonstration as follows: restrict all p e P such that they 
must define li :— i as the unique task identifier li for Ti (see Section [3.1.2l ); restrict all s e iS such that the 
input of li = i automatically invokes sub-program s[, apart of Si but nof of Si_i (although s[ may read Si_i 
or invoke parts of Si_i as sub-programs to solve Ti). Restrict J; to a subset of acceptable computational 
outcomes (Section I3.1.2I I. Run Si until it halts and has computed a novel output acceptable by Ji that 
is different from all outputs computed by the (halting) solutions to T^i, this novel output becomes T^'s 
goal. By induction over i, since all previously used components of Si_i remain unmodified, the set r<, is 
guaranteed to remain solvable, no matter s[. That is, CORRECTNESS DEMONSTRATION on previous tasks 
becomes trivial. However, in this simple setup there is no immediate generalization across tasks like in 
OOPS II4TI and the previous paragraph: the trivial task identifier i will always first invoke some s'i different 
from all s'^.{k ^ i), instead of allowing for solving a new task solely by previously found code. 

4 Implementations of PowerPlay 

PowerPlay is a general framework that allows for plugging in many differents search and learning algo- 
rithms. The present section will discuss some of them. 

4.1 Implementation Based on Optimal Ordered Problem Solver OOPS 



Alg.l4.lt Implementing PowerPlay with Procedure OOPS [41 1 

(see text for details) - initialize sq and u (internal dynamic storage for s G 5) and U (internal dynamic 
storage for p ^ V), where each possible p is a sequence of subprograms p' , p" , p'" . 
for i :— 1,2, . . . do 

set variable time limit tum 

let the variable set H be empty; 

set Boolean variable DONE := FALSE 

repeat 

if H is empty then 

set tu„, := 2tu„,; H ■.= {peV: P{p)tu„^ > 1} 
else 

choose and remove some p from H 

while not DONE and less than P{p)tiim time was spent on p do 
execute the next time step of the following computation: 

1 . Let p' compute some task T E T and halt. 

2. Let p" compute q E Shy modifying a copy of s^-i, and halt. 

3. Let p'" try to show that q but not Si_i can solve Ti, 12, . . . , Ti_i, T. 
Ifp'" was successful set DONE := TRUE. 

end while 

Undo all modifications of u and U due to p. This does not cost more time than executing p in the 
while loop above Wl\ . 
end if 
until DONE 

set Pi := p; Ti := T; Si q; 

add a unique encoding of the 5-tuple {i,Pi, Si,Ti, Tracet) to read-only storage readable by programs 
to be tested in the future, 
end for 



The i-th problem is to find a program pi E V that creates Si and T!j and demonstrates that Si but not 
Si-i can solve Ti, T2, . . . , T^. This yields a perfectly ordered problem sequence for a variant of the Optimal 
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Ordered Problem Solver OOPS gl] (AlgorithmlOTi. 

While a candidate program p g P is executed, at any given discrete time step t = 1, 2, its internal 
state or dynamical storage U at time t is denoted U (t) G B* (not to be confused with the solver's internal 
state u{t) of Section [3.1.2b . Its initial default value is U{0). E.g., U{t) could encode the current contents 
of the internal tape of a TM (to be modified by p), or of certain cells in the dynamic storage area of a PC. 

Once Pi is found, pi ,Si,Ti, Tracei (if applicable; see Section [3.1.2l ) will be saved in unmodifiable read- 
only storage, possibly together with other data observed during the search so far. This may greatly facilitate 
the search for pk,k > i, since pk may contain instructions for addressing and reading pj , Sj , Tj , Trace j [j = 
1, . . . , fc — 1) and for copying the read code into modifiable storage U, where pk may further edit the code, 
and execute the result, which may be a useful subprogram BTI . 

Define a probability distribution P{p) on V to represent the searcher's initial bias (more likely programs 
p will be tested earlier |J_7 1). P could be based on program length, e.g., P(p) — 2^^*^^^ or on a probabilistic 
syntax diagram |I4T]|40|. See Algorithm|4.1| 

OOPS keeps doubling the time limit until there is sufficient runtime for a sufficiently likely program 
to compute a novel, previously unsolvable task, plus its solver, which provably does not forget previ- 
ous solutions. OOPS allocates time to programs according to an asymptotically optimal universal search 
method fTTI for problems with easily verifiable solutions, that is, solutions whose validity can be quickly 
tested. Given some problem class, if some unknown optimal program p requires /(fc) steps to solve a 
problem instance of size k and demonstrate the correctness of the result, then this search method will need 
at most 0{f{k)/P{p)) — 0{f{k)) steps — the constant factor 1/ P{p) may be large but does not depend 
on k. Since OOPS may re-use previously generated solutions and solution-computing programs, however, 
it may be possible to greatly reduce the constant factor associated with plain universal search |41 1. 

The big difference to previous implementations of OOPS is that PowerPlay has the additional free- 
dom to define its own tasks. As always, every new task added to the repertoire is essentially defined by the 
time required to invent it, to solve it, and to demonstrate that no previously learned skills got lost. 

4.1.1 Building on Existing OOPS Source Code 

Existing OOPS source code Bol uses a FORTH-like universal programming language to define V. It 
already contains a framework for testing new code on previously solved tasks, and for efficiently undoing 
all [/-modifications of each tested program. The source code will require a few changes to implement the 
additional task search described above. 

4.1.2 Alternative Problem Solvers Based on Recurrent Neural Networks 

Recurrent NNs (RNNs, e.g., 1591 1611 l27l [32l [121) are general computers that allow for both sequential 
and parallel computations, unlike the strictly sequential FORTH-like language of Section 14.1.1 1 They can 
compute any function computable by a standard PC ||29l . The original report |46| used a fully connected 
RNN called RNNl to define S, where w"'' is the real-valued weight on the directed connection between the 
Z-th and fc-th neuron. To program RNNl means to set the weight matrix s = {w"'). Given enough neurons 
with appropriate activation functions and an appropriate (u)"^). Algorithm 14. II can be used to train s. V 
may itself be the set of weight matrices of a separate RNN called RNN2, computing tasks for RNNl, and 
modifications of RNNl, using techniques for network-modifying networks as described in previous work 
[33 , 35 , 34 1 . 

In first experiments |[53l|52l, a particularly suited NN called a self-delimiting NN or SLIM NN 143 is 
used. During program execution or activation spreading in the SLIM NN, lists are used to trace only those 
neurons and connections used at least once. This also allows for efficient resets of large NNs which may use 
only a small fraction of their weights per task. Unlike standard RNNs, SLIM NNs are easily combined with 
techniques of asymptotically optimal program search liT7ll48l[39ll4ll (Section l^TI ). To address overfitting, 
instead of depending on pre- wired regularizers and hyper-parameters l3], SLIM NNs can in principle learn 
to select by themselves their own runtime and their own numbers of free parameters, becoming fast and 
slim when necessary. Efficient SLIM NN learning algorithms (LAs) track which weights are used for which 
tasks (Section [332|l, to greatly speed up performance evaluations in response to limited weight changes. 
LAs may penalize the task-specific total length of connections used by SLIM NNs implemented on the 
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3-dimensional brain-like multi-processor hardware to expected in the future. This encourages SLIM NNs 
to solve many subtasks by subsets of neurons that are physically close f^ \. 

4.2 Adapting the Probability Distribution on Programs 

A straightforward extension of the above works as follows: Whenever a new pi is found, P is updated to 
make either only pi or all pi , p2 , • • • , Pi more likely. Simple ways of doing this are described in previous 
work [48 1 . This may be justified to the extent that future successful programs turn out to be similar to 
previous ones. 

4.3 Implementation Based on Stochastic or Evolutionary Search 

A possibly simpler but less general approach is to use an evolutionary algorithm to produce an s-modifying 
and task-generating program p as requested by PowerPlay, according to Algorithm 14. 31 which refers to 
the recurrent net problem solver of Section l4.1.2l 

Alg.l4.3t PowerPlay for RNNs Using Stochastic or Evolutionary Search 

Randomly initialize RNNl's variable weight matrix (w'*^) and use the result as sq (see Section l4.1.2l ) 
for 1, 2, ... do 

set Boolean variable DONE=FALSE 

repeat 

use a black box optimization algorithm BBOA (many are possible ll24l[T0ll60ll49l ) with adaptive 
parameter vector 9 to create some T G T (to define the task input to RNNl; see Section lTTI ) and a 
modification of Si_i, the current (w"^) of RNNl, thus obtaining a new candidate gr e iS 
if q but not Si_i can solve T and all Tk{k < i) (see Sections [3.3ll3.3.2] l then 

set DONE=TRUE 
end if 
until DONE 

set Si :— q; {w"') := q; Ti :— T; (also store Tracci if applicable, see Section [3. 1.2l i. Use the 
information stored so far to adapt the parameters 9 of the BBOA, e.g., by gradient-based search 1601 
l49l . or according to the principles of evolutionary computation ll24l[T0ll60l . 
end for 



5 Outgrowing Trivial Tasks - Compressing Previous Solutions 

What prevents PowerPlay from inventing trivial tasks forever by extreme modularization, simply allo- 
cating a previously unused solver part to each new task, which thus becomes rather quickly verifiable, as its 
solution does not affect solutions to previous tasks (Section [3.3.3t ? On realistic but general architectures 
such as PCs and RNNs, at least once the upper storage size limit of s is reached, PowerPlay will start 
"compressing" previous solutions, making s generalize in the sense that the same relatively short piece of 
code (some part of s) helps to solve different tasks. 

With many computational architectures, this type of compression will start much earlier though, be- 
cause new tasks solvable by partial reuse of earlier discovered code will often be easier to find than new 
tasks solvable by previously unused parts of s. This also holds for growing architectures with potentially 
unlimited storage space. 

Compare also PowerPlay Variant II of Section lTTI whose tasks may explicitly require improving the 
average time and space complexity of previous solutions by some minimal value. 

In general, however, over time the system will find it more and more difficult to invent novel tasks 
without forgetting previous solutions, a bit like humans find it harder and harder to learn truly novel behav- 
iors once they are leaving behind the initial rapid exploration phase typical for babies. Experiments with 
various problem solver architectures (e.g., I53i i52l) are needed to analyze such effects in detail. 
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6 Adding External Tasks 



The growing repertoire of the problem solver may facilitate learning of solutions to externally posed tasks. 
For example, one may modify PowerPlay such that for certain i, Ti is defined externally, instead of being 
invented by the system itself. In general, the resulting Si will contain an externally inserted bias in form 
of code that will make some future self-generated tasks easier to find than others. It should be possible to 
push the system in a human-understandable or otherwise useful direction by regularly inserting appropriate 
external goals. See Algorithm l7.ll 

Another way of exploiting the growing repertoire is to simply copy Si for some i and use it as a starting 
point for a search for a solution to an externally posed task T, without insisting that the modified si also 
can solve Ti, T2, . . . , T";. This may be much faster than trying to solve T from scratch, to the extent the 
solutions to self-generated tasks reflect general knowledge (code) re-usable for T. 

In general, however, it will be possible to design external tasks whose solutions do not profit from those 
of self-generated tasks — the latter even may turn out to slow down the search. 

On the other hand, in the real world the benefits of curious exploration seem obvious. One should 
analyze theoretically and experimentally under which conditions the creation of self-generated tasks can 
accelerate the solution to externally generated tasks — see 1301 l54l[37l[38l [191 |4l l28ll62l for previous simple 
experimental studies in this vein. 

6.1 Self -Reference Through Novel Task Search as an External Task 

PowerPlay's i-th goal is to find api E P that creates Tj and si (a modification of Si_i) and shows that 
Si but not Si-i can solve T<i. As s itself is becoming a more and more general problem solver, s may 
help in many ways to achieve such goals in self-referential fashion. For example, the old solver Sj_i may 
be able to read a unique formal description (provided by pi) of PowerPlay's i-th goal, viewing it as an 
external task, and produce an output unambiguously describing a candidate for (Ti, Si). If s has a theorem 
prover component (Section r3.3.1| ), Si_i might even output a full proof of (Ti, Si)'s validity; alternatively pi 
could just use the possibly suboptimal suggestions of Si_i to narrow down and speed up the search, one of 
the reasons why Section |2| already mentioned that programs p E V should contain instructions for reading 
(and running) the code of the present problem solver 

7 Softening Task Acceptance Criteria of PowerPlay 

The PowerPlay variants above insist that s may not solve new tasks at the expense of forgetting to 
solve any previously solved task within its previously established time and space bounds. For example, 
let us consider the sequential decision-making tasks from Section [3. 1.21 Suppose the problem solver can 
already solve task Tk = {Ik, Jk,tk,nk) €lxJ'xNxN. A very similar but admissible new task 
Ti — {Ik, Jk, ti, Tik), {i > k), would be to solve Tk substantially faster: ti < tk — e, as long as Ti is not 
akeady solvable by Si_i, and no solution to some Ti{l < i) is forgotten in the process. 

Here I discuss variants of PowerPlay that soften the acceptance criteria for new tasks in various 
ways, for example, by allowing some of the computations of solutions to previous non-external (Section|6]l 
tasks to slow down by a certain amount of time, provided the sum of their runtimes does not decrease. This 
also permits the system to invent new previously unsolved tasks at the expense of slightly increasing time 
bounds for certain already solved non-external tasks, but without decreasing the average performance on 
the latter. Of course, PowerPlay has to be modified accordingly, updating average runtime bounds when 
necessary. 

Alternatively, one may allow for trading off space and time constraints in reasonable ways, e.g., in the 
style of asymptotically optimal Universal Search [17], which essentially trades one bit of additional space 
complexity for a runtime speedup factor of 2. 
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7.1 PowerPlay Variant II: Explicitly Penalizing Time and Space Complexity 



Let us remove time and space bounds from the task definitions of Section [3.1.2[ since the modified cost- 
based PowerPlay framework below (Algorithm 17. lb will handle computational costs (such as time and 
space complexity of solutions) more directly. In the present section, Ti encodes a tuple {Ii,Ji) E I x J 
with interpretation: Si must first read and then interact with an environment through a sequence of 
perceptions and actions, to achieve some computable goal defined by Ji within a certain maximal time 
interval tmax (a positive constant). Let t'^{T) be tmax if s cannot solve task T, otherwise it is the time 
needed to solve T by s. Let 1'^ (T) be the positive constant Imax if s cannot solve T, otherwise it is the 
number of components of s needed to solve task T by s. The non-negative real-valued reward r{T) for 
solving T is a positive constant Vnew for self-defined previously unsolvable T, or user-defined if T is 
an external task solved by s (Section |6]l. The real-valued cost Cost{s, TSET) of solving all tasks in 
a task set TSET through s is a real-valued function of: all /^(T), t'^{T) (for all T e TSET), L{s), 
and J^TeTSET ^C^)- ^'^^ example, the cost function Cost{s, TSET) — L{s) + a J^TeTSEri^'sC^) ~ 
r{T)] encourages compact and fast solvers solving many different tasks with the same components of s, 
where the real-valued positive parameter a weighs space costs against time costs, and rnew should exceed 
tmax to encourage solutions of novel self-generated tasks, whose cost contributions should be below zero 
(alternative cost definitions could also take into account energy consumption etc.) 

Let us keep an analogue of the remaining notation of Section [3.1.21 such as Ui {t),Xi (t) , ri (t) , yi (t) , Tracci , 
JiiTracci). As always, if the environment is unknown and possibly changing over time, to test perfor- 
mance of a new solver s on a previous task Tk, only Tracek is necessary — see Section [3. 1.21 As always, 
let T<i denote the set containing all tasks Ti, . . . ,Ti (note that if Ti=Tk for some k < i then it will appear 
only once in 7<i), and let e > again define what's acceptable progress: 



Alg. 17.11 PowerPlay Framework (Variant II) Explicitly Handling Costs of Solving Tasks 

InitiaUze sq in some way 
for I := 1, 2, ... do 

Create new global variables Ti ^ T, Si ^ S, pi £ V, Ci,c* £ M (to be fixed by the end of repeat) 
repeat 

Let a search algorithm (SectionlJlJ set pi, a new candidate program. Give pi limited time to do: 

* Task Invention: Unless the user specifies T (Section|3), let pi set T. 

* Solver Modification: Let pi set Si by computing a modification of Sj_i (Section [J!2] i. 

* Correctness Demonstration: Let pi compute a := Cost{si,T<i) and c* := 
Cost{s.i^i,T<i) 

until c* — Ci > e (minimal savings of costs such as time/space/etc on all tasks so far) 
Freeze/store forever pi,Ti,Si,Ci, c* 
end for 



By Algorithm 17.11 Si may forget certain abilities of Si_i, provided that the overall performance as 
measured by Cost{si,T<i) has improved, either because a new task became solvable, or previous tasks 
became solvable more efficiently. 

Following Section [33l CORRECTNESS Demonstration can often be facihtated, for example, by 
tracking which components of Si are used for solving which tasks (Section r3.3.2l ). 

To further refine this approach, consider that in phase i, the list L'l (defined in Section l3.3.2l ) contains 
all previously learned tasks whose solutions depend on s^. This can be used to determine the current value 
Val{s^) of some component of s: Val{s^) = — XItgl'' Cost{si,T<i). It is a simple exercise to invent 
PowerPlay variants that do not forget valuable components as easily as less valuable ones. 

The implementations of Sections [4.11 and l431 are easily adapted to the cost-based PowerPlay frame- 
work. Compare separate papers [53. ,52] . 
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7.2 Probabilistic PowerPlay Variants 



Section [3 . 1 .21 pointed out that in partially observable and/or non- stationary unknown environments COR- 
RECTNESS Demonstration must use Tracck to check whether a new Si still knows how to solve an 
earlier task Tk{k < i). A less strict variant of PowerPlay, however, will simply make certain assump- 
tions about the probabilistic nature of the environment and the repeatability of trials, assuming that a limited 
fixed number of interactions with the real world are sufficient to estimate the costs c* , Ci in Algorithm l7.ll 
Another probabilistic way of softening PowerPlay is to add new tasks without proof that s won't 
forget solutions to previous tasks, provided CORRECTNESS DEMONSTRATION can at least show that the 
probability of forgetting any previous solution is below some real-valued positive constant threshold. 

8 First Illustrative Experiments 

First experiments are reported in separate papers (ST, 321 (some experiments were also briefly mentioned 
in the original report [46 J ). Standard NNs as well as SLIM RNNs |47 1 are used as computational problem 
solving architectures. The weights of SLIM RNNs can encode essentially arbitrary computable tasks as 
well as arbitrary, self-delimiting, halting or non-halting programs solving those tasks. These programs may 
affect both environment (through effectors) and internal states encoding abstractions of event sequences. 
In open-ended fashion, the PowERPLAY-driven NNs learn to become increasingly general solvers of self- 
invented problems, continually adding new problem solving procedures to the growing repertoire, some- 
times compressing/speeding up previous skills, sometimes preferring to invent new tasks and correspond- 
ing skills. The NNs exhibit interesting developmental stages. It is also shown how a PowERPLAY-driven 
SLIM NN automatically self-modularizes |52|, frequently re-using code for previously invented skills, al- 
ways trying to invent novel tasks that can be quickly validated because they do not require too many weight 
changes affecting too many previous tasks. 

9 Previous Relevant Work 

Here I discuss related research, in particular, why the present work is of interest despite the recent advent 
of theoretically optimal universal problem solvers (Section l9Tt . and how it can be viewed as a greedy but 
feasible and sound implementation of the formal theory of creativity (Section l93] l. 

9.1 Existing Theoretically Optimal Universal Problem Solvers 

The new millennium brought universal problem solvers that are theoretically optimal in a certain sense. 
The fully self-referential |9| Godel machine Il43ll44l may interact with some initially unknown, partially 
observable environment to maximize future expected utility or reward by solving arbitrary user-defined 
computational tasks. Its initial algorithm is not hardwired; it can completely rewrite itself without essential 
limits apart from the limits of computability, but only if a proof searcher embedded within the initial 
algorithm can first prove that the rewrite is useful, according to the formalized utility function taking 
into account the limited computational resources. Self-rewrites due to this approach can be shown to be 
globally optimal, relative to Godel's well-known fundamental restrictions of provability |9|. To make sure 
the Godel machine is at least asymptotically optimal even before the first self -rewrite, one may initialize it 
by Hutter's non-self -referential but asymptotically fastest algorithm for all well-defined problems Hsearch 
ifTJl . which uses a hardwired brute force proof searcher and ignores the costs of proof search. Assuming 
discrete input/output domains X/Y C B*, a formal problem specification f : X Y (say, a functional 
description of how integers are decomposed into their prime factors), and a particular x € X (say, an 
integer to be factorized), Hsearch orders all proofs of an appropriate axiomatic system by size to find 
programs q that for all z £ X provably compute /(z) within time bound tq{z). Simultaneously it spends 
most of its time on executing the q with the best currently proven time bound tq{x). Hsearch is as fast as 
Xhs fastest algorithm that provably computes f{z) for all z e X, save for a constant factor smaller than 
1 + e (arbitrarily small real-valued e > 0) and an /-specific but x-independent additive constant 1 13 1. Given 
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some problem, the Godel machine may decide to replace Hsearch by a faster method suffering less from 
large constant overhead, but even if it doesn't, its performance won't be less than asymptotically optimal. 

Why doesn't everybody use such universal problem solvers for all computational real- world problems? 
Because most real-world problems are so small that the ominous constant slowdowns (potentially relevant 
at least before the first self-rewrite) may be large enough to prevent the universal methods from being 
feasible. 

POWERPlay, on the other hand, is designed to incrementally build a practical more and more general 
problem solver that can solve numerous tasks quickly, not in the asymptotic sense, but by exploiting to 
the max its given particular search algorithm and computational architecture, with all its space and time 
limitations, including those reflected by constants ignored by the asymptotic optimality notation. 

As mentioned in Section |6] however, one must now analyze under which conditions PowerPlay's 
self-generated tasks can accelerate the solution to externally generated tasks (compare previous experi- 
mental studies of this type ||30ll54l|37ll38l). 

9.2 Connection to Traditional Active Learning 

Traditional active learning methods |6| such as AdaBoost |8| have a totally different set-up and purpose: 
there the user provides a set of samples to be learned, then each new classifier in a series of classifiers 
focuses on samples badly classified by previous classifiers. Open-ended PowerPlay, however, (1) con- 
siders arbitrary computational problems (not necessarily classification tasks); (2) can self-invent all com- 
putational tasks. There is no need for a pre-defined global set of tasks that each new solver tries to solve 
better, instead the task set continually grows based on which task is easy to invent and validate, given what 
is already known. 

9.3 Greedy Implementation of Aspects of the Formal Theory of Creativity 

The Formal Theory of Creativity ll42l|45l considers agents living in initially unknown environments. At any 
given time, such an agent uses a reinforcement learning (RL) method 1 15] to maximize not only expected 
future external reward for achieving certain goals, but also intrinsic reward for improving an internal model 
of the environmental responses to its actions, learning to better predict or compress the growing history of 
observations influenced by its behavior, thus achieving wow-effects, actively learning skills to influence the 
input stream such that it contains previously unknown but learnable algorithmic regularities. I have argued 
that the theory explains essential aspects of intelligence including selective attention, curiosity, creativity, 
science, art, music, humor, e.g., 14211451 . Compare recent related work, e.g., lITI l5l |22| l20l . 

Like PowerPlay, such a creative agent produces a sequence of self-generated tasks and their solu- 
tions, each task still unsolvable before learning, yet becoming solvable after learning. The costs of learning 
as well as the learning progress are measured, and enter the reward function. Thus, in the absence of ex- 
ternal reward for reaching user-defined goals, at any given time the agent is motivated to invent a series of 
additional tasks that maximize future expected learning progress. 

For example, by restricting its input stream to self-generated pairs (/, O) G I x O like in Section l3. 1 . II 
and hmiting it to predict only O, given / (instead of also trying to predict future (/, O) pairs from previous 
ones, which the general agent would do), there will be a motivation to actively generate a sequence of 
(/, O) pairs such that the O are first subjectively unpredictable from their / but then become predictable 
with little effort, given the limitations of whatever learning algorithm is used. 

Here some cons and pros of PowerPlay are listed in light of the above. Its drawbacks include: 

1. Instead of maximizing future expected reward, PowerPlay is greedy, always trying to find the 
simplest (easiest to find and validate) task to add to the repertoire, or the simplest way of improving 
the efficiency or compressibility of previous solutions, instead of looking further ahead, as a universal 
RL method |42 , 45 1 would do. That is, PowerPlay may potentially sacrifice large long-term gains 
for small short-term gains: the discovery of many easily solvable tasks may at least temporarily 
prevent it from learning to solve hard tasks. 

On general computational architectures such as RNNs (Section |4.1.2| |. however, PowerPlay is 
expected to soon run out of easy tasks that are not yet solvable, due to the architecture's limited 
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capacity and its unavoidable generalization effects (many never-tried tasks will become solvable by 
solutions to the few explicitly tested Ti). Compare Section|5] 

2. The general creative agent above [42', '451 is motivated to improve performance on the entire history 
of previous still unsolved tasks, while PowerPlay may discard much of this history, keeping only a 
selective list of previously solved tasks. However, as the system is interacting with its environment, 
one could store the entire continually growing history, and make sure that T always allows for 
defining the task of better compressing the history so far. 

3. PowerPlay as in Section |2] has a binary criterion for adding knowledge (was the new task solv- 
able without forgetting old solutions?), while the general agent ll42l l45l uses a more informative 
information-theoretic measure. The cost-based PowerPlay framework (Alg. 17.11 ) of Section |2l 
however, offers similar, more flexible options, rewarding compression or speedup of solutions to 
previously solved tasks. 

On the other hand, drawbacks of previous implementations of formal creativity theory include: 

1 . Some previous approximative implementations ll30l |54]| used traditional RL methods ifTSl with the- 
oretically unlimited look-ahead, but those are not guaranteed to work well in partially observable 
and/or non-stationary environments where the reward function changes over time, and won't neces- 
sarily generate an optimal sequence of future tasks or experiments. 

2. Theoretically optimal implementations ll42l|45]| are currently still impractical, for reasons similar to 
those discussed in Section |9T| 

Hence PowerPlay may be viewed as a greedy but feasible implementation of certain basic principles 
of creativity |42, 45 1. PoWERPLAY-based systems are continually motivated to invent new tasks solvable 
by formerly unknown procedures, or to compress or speed up problem solving procedures discovered 
earlier. Unlike previous implementations, PowerPlay extracts from the lifelong experience history a 
sequence of clearly identified and separated tasks with explicitly recorded solutions. By design it cannot 
suffer from online learning problems affecting its solver's performance on previously solved problems. 

9.4 Beyond Algorithmic Zero-Sum Games (371 SI (1997-2002) 

This guaranteed robustness against forgetting previous skills also represents a difference to the most closely 
related previous work |37 , 38|. There, to address the computational costs of learning, and the costs of mea- 
suring learning progress, computationally powerful encoders and problem solvers |f36l [38l (1997-2002) 
are implemented as two very general, co-evolving, symmetric, opposing modules called the right brain 
and the left brain. Both are able to construct self-modifying probabilistic programs written in a universal 
programming language. An internal storage for temporary computational results of the programs is viewed 
as part of the changing environment. Each module can suggest experiments in the form of probabilistic 
algorithms to be executed, and make predictions about their effects, betting intrinsic reward on their out- 
comes. The opposing module may accept such a bet in a zero-sum game by making a contrary prediction, 
or reject it. In case of acceptance, the winner is determined by executing the experiment and checking its 
outcome; the intrinsic reward eventually gets transferred from the surprised loser to the confirmed winner. 
Both modules try to maximize reward using a rather general RL algorithm (the so-called success-story 
algorithm SSA |48 1) designed for complex stochastic policies (alternative RL algorithms could be plugged 
in as well). Thus both modules are motivated to discover novel algorithmic patterns/compressibility (= 
surprising wow-ejfects), where the subjective baseline for novelty is given by what the opponent already 
knows about the (external or internal) world's repetitive patterns. Since the execution of any computational 
or physical action costs something (as it will reduce the cumulative reward per time ratio), both modules are 
motivated to focus on those parts of the dynamic world that currently make surprises and learning progress 
easy, to minimize the costs of identifying promising experiments and executing them. The system learns a 
partly hierarchical structure of more and more complex skills or programs necessary to solve the growing 
sequence of self-generated tasks, reusing previously acquired simpler skills where this is beneficial. Exper- 
imental studies exhibit several sequential stages of emergent developmental sequences, with and without 
external reward ll37l [38l . 
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However, the previous system ||37l [38'l did not have a built-in guarantee that it cannot forget previously 
learned skills, while PowerPlay as in Section |2] does (and the time and space complexity-based variant 
Alg. l7.1l of Section|2]can forget only if this improves the average efficiency of previous solutions). 

To analyze the novel framework's consequences in practical settings, experiments are currently being 
conducted with various problem solver architectures with different generalization properties. See separate 
papers 15311521 and Section|8] 

9.5 Opposing Forces: Improving Generalization Through Compression, Breaking 
Generalization Through Novelty 

Two opposing forces are at work in PowerPlay. On the one hand, the system continually tries to improve 
previously learned skills, by speeding them up, and by compressing the used parameters of the problem 
solver, reducing its effective size. The compression drive tends to improve generalization performance, 
according to the principles of Occam's Razor and Minimum Description Length (MDL) and Minimum 
Message Length (MML) ISO] [111 |57l |58l EI] |26l [JS] [I4J . On the other hand, the system also continually 
tries to invent new tasks that break the generalization capabilities of the present solver 

PowerPlay's time-minimizing search for new tasks automatically manages the trade-off between 
these opposing forces. Sometimes it is easier (because fewer computational resources are required) to 
invent and solve a completely new, previously unsolvable problem. Sometimes it is easier to compress (or 
speed up) solutions to previously invented problems. 

9.6 Relation to Godel's Sequence of Increasingly Powerful Axiomatic Systems 

In 1931, Kurt Godel showed that for each sufficiently powerful (w-) consistent axiomatic system there is a 
statement that must be true but cannot be proven from the axioms through an algorithmic theorem-proving 
procedure f9l. This unprovable statement can then be added to the axioms, to obtain a more powerful 
formal theory in which new formerly unprovable theorems become provable, without affecting previously 
provable theorems. 

In a sense, PowerPlay is doing something similar. Assume the architecture of the solver is a universal 
computer |9, 56|. Its software s can be viewed as a theorem-proving procedure implementing certain enu- 
merable axioms and computable inference rules. PowerPlay continually tries to modify s such that the 
previously proven theorems remain provable within certain time bounds, and a new previously unprovable 
theorem becomes provable. 

10 Words of Caution 

The behavior of PowerPlay is determined by the nature and the limitations of T, S, V, and its algorithm 
for searching V. If T includes all computable task descriptions, and both S and P allow for implementing 
arbitrary programs, and the search algorithm is a general method for search in program space (SectionlU, 
then there are few limits to what PowerPlay may do (besides the limits of computability |9|). 

It may not be advisable to let a general variant of PowerPlay loose in an uncontrolled situation, 
e.g., on a multi-computer network on the internet, possibly with access to control of physical devices, and 
the potential to acquire additional computational and physical resources (Section r3.1.2t through programs 
executed during PowerPlay. Unlike, say, traditional virus programs, PowerPlay -based systems will 
continually change in a way hard to predict, incessantly inventing and solving novel, self-generated tasks, 
only driven by a desire to increase their general problem-solving capacity, perhaps a bit like many humans 
seek to increase their power once their basic needs are satisfied. This type of artificial curiosity/creativity, 
however, may conflict with human intentions on occasion. On the other hand, unchecked curiosity may 
sometimes also be harmful or fatal to the learning system itself (Section^ — curiosity can kill the cat. 
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